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Abstract 

We give a new "nested adds" circuit for implementing Shor's algorithm in linear width and quadratic 
depth on a nearest-neighbor machine. Our circuit combines Draper's transform adder with approximation 
ideas of Zalka. The transform adder requires small controlled rotations. We also give another version, 
with slightly larger depth, using only reversible classical gates. We do not know which version will 
ultimately be cheaper to implement. 

1 Introduction 

Wc describe a new quantum exponentiation circuit that obeys a "nearest-neighbor" constraint: we imagine 
that qubits are arranged in a line, and we are only allowed to perform interactions between adjacent qubits. 
Previous n-bit nearest-neighbor exponentiation circuits |FDH04| IVan06| required either depth 0(n 3 ) or 
superlinear width, but our construction has width O(n) and depth 0(n 2 ). This new exponentiation circuit, 
together with a nearest-neighbor quantum Fourier transform (QFT) FDH04 , gives a new circuit for Shor's 
factorization algorithm |Sho99| . 

A number of people have constructed exponentiation circuits for general architectures (i.e., without the 
nearest- neighbor restriction). See, for example, (VI05I IVIL05I fVan06j for recent summaries. Many of the 
techniques used to reduce circuit depth do not appear to apply to a nearest-neighbor architecture. 

Beauregard |Bea03) has given a simple exponentiation circuit using Draper's transform adder |DraOO| . 
The adder requires two QFTs together with some controlled rotations. Beauregard's circuit uses only 2n + 
0(1) qubits, but has cubic depth — the dominant cost is 0(n 2 ) applications of the transform adder. Fowler, 
Devitt, and Hollenbcrg jFDH04l modify Beauregard's circuit for use on a nearest-neighbor machine, and 
they show that these modifications do not affect the dominant terms in the expression for size or depth. 

Our contribution is a new approximate controlled modular multiplier with linear width and linear depth. 
We use an idea of Zalka |Zal02| for building approximate multipliers. While wc still multiply by performing 
0(n) additions, we only perform a constant number of large QFTs for each multiply. When we insert our 
multiplier into the framework of Fowler ct al. , we obtain a nearest-neighbor exponentiation circuit with linear 
width and quadratic depth. 1 

We first set some notation and review prior work in Section |2 We describe our multiplier and the 
resulting exponentiator in Section |3J and we discuss a version for general architectures in Section [5] 

Following Fowler et al., wc assume that any interaction between two adjacent qubits has unit cost. In 
practice, some gates may be easier to implement than others. Our circuit requires small controlled rotations 
that may prove expensive. Van Meter |Van06| discusses the error correction requirements for various adders 
and suggests that the transform adder may not be useful in practice. In Section we describe a version of 
the circuit that is essentially classical and that does not require these small rotations. However, the depth 
increases to 0(n 2 logn). This is the same asymptotic depth achieved by Van Meter |Van06j . but we require 
only linear width. 

"Center for Communications Research, 805 Bunn Drive, Princeton, NJ 08540. kutin3idaccr.org 

1 Zalka IZal06l has recently pointed out this same idea of performing mulitple additions framed by a single QFT, but he does 
not work out any details or discuss the application to nearest-neighbor circuits. 
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2 Preliminaries 



Our goal is to compute w = g e mod to. Here g and to arc n-bit constants, known to the classical compiler 
that builds our circuit. The 2n-bit exponent e is in quantum memory. 2 Using a standard trick (see, for 
example, |Bea03p . we can assume that only one bit of e at a time is stored in our quantum computer. 
Writing e = 2 l ei, we have 

w = ^PJ(ff 2 mod m) e " j mod m. 

That is, we can decompose our exponentiation into 2n controlled multiplications. In each case we multiply 
by 1 if the controlling bit is 0, and by a constant if ej is 1. 

In Section |2~T1 we describe how we reduce controlled modular multiplication to (roughly) n controlled 
additions. In Section 12.21 we describe the addition routine we will use. 

We refer the reader to Fowler et al. |FDH04| for useful building blocks for nearest-neighbor circuits. We 
will use their "mesh" circuit for interleaving two registers. We will not use their controlled swap; instead, in 
Section 12.31 we describe a simpler controlled swap for the case when one register is known to be 0. 



2.1 Approximate Modular Multiplication 

We now present a scheme of Zalka |Zal02| for performing controlled modular multiplication. We wish to 
compute 

r = abc mod m, 

where a and m are n-bit constants, b = 53- 2 l bi is in n bits of quantum memory, and c is a control bit. We 
can write 

r = abc = 2 l abic = ^^(&ic) (2 l a mod to) (mod m). 

i i 

We can view this as repeated controlled modular addition; the numbers X{ = 2 l a mod m are known at 
compile-time, and we have n control bits yi = 6jC. 
We define the partial sum 

s = ^ y~ iXi =- r - Q m - 

i 

The sum s is congruent to the answer r (mod m). Also, since s < nm, the quotient q is at most n. In 
particular, we can write down q using only log 2 n bits. 

Zalka's key idea is to approximate the desired answer r in two parallel steps. First, we compute s by 
repeated controlled addition into an n-bit accumulator. Second, we approximate q: We choose some £q = 
0(log n), and we compute q using only the £q high bits of each Xi. More precisely, let Xi = 2 n ~ l ° [xi/2"~ £ °j . 
Then q = [(X] Vi^i)l m \ ■ We can easily compute q in depth 0(log 2 n). With high probability, q = q. 

Once we have q = 2*%, subtracting qm from s can be done with log 2 n additional controlled adds into 
our accumulator (we subtract 2'm controlled by qi). Next, we must erase q; again; this takes only 0(log 2 n) 
depth. So, aside from a lower-order term, the cost of controlled modular multiplication is about n controlled 
additions, or, equivalently, one controlled integer multiplication. 

There are other schemes that give modular multiplication circuits at a cost of three times the cost of 
integer multiplication (sec, for example, [Dhc98j ) . So it might seem that Zalka's idea would save only a 
constant factor. However, Zalka's idea is conceptually simpler; without it, we might not have found the 
linear-depth multiplier of Section [3] 

2 More generally, e has length an, and the error rate of the algorithm depends on a. For simplicity we take a = 2. 
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2.2 The Transform Adder 



Most quantum arithmetic circuits are essentially classical in nature. Draper |DraOO| has given an addition 
circuit that is inherently quantum. We briefly describe this circuit, and then discuss how to adapt it to the 
nearest-neighbor setting. 

Suppose we have an n-bit number register containing u = J2j=o U P^ ■ Then the QFT maps \u) to 
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k=0 



\k) 



where 



^ w = -L(|o) + e 2 ^ +I | 1> ). 



Note that \<j>{u)) is an unentangled state. 

Suppose we want to add v to u. We can replace each bit 4>j{ u ) by (j>j(u + v); this is simply a Z-rotation 
by an angle of 2wv/2 J+1 , so we can rotate each bit independently. To perform controlled addition, each of 
these rotations is controlled by a bit c. We can then perform an inverse QFT to change \<p(u + v)) to \u + v). 

One way to view the QFT is that we have moved the information about u into the phase of the qubits. 
To do a modular reduction and test the high bit of u, we first need to perform an inverse QFT. So, for a 
naively designed modular exponentiation circuit, we perform 0(n 2 ) QFTs and inverse QFTs. Our main 
result is a circuit design with only 0(n) QFTs. 
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Figure 1: Quantum Fourier transform of a 4-bit register on a nearest-neighbor machine. © denotes a 
Z-rotation by 2-k/2 3 . 



Fowler et al. |FDH04j give a nearest-neighbor form of the QFT. A 4-bit version is depicted in Figure ^ 
After each controlled rotation, we swap the two bits involved, so every pair of bits can interact. (If we leave 
out the swaps, we obtain the linear-depth QFT of Moore and Nilsson |MN01| .') Note that we assign unit 
cost to the controlled rotation together with the accompanying swap. 

The size of this QFT circuit is n 2 /2 + 0{n). We may be able to approximate the QFT and skip some 
of the small rotations. On a general machine, this reduces the size to O(nlogn), but on a nearest-neighbor 
machine we still have to perform (") swaps. 



2.3 Pseudo-Toffolis and Controlled Swaps 
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Figure 2: Pseudo-Toffoli gate v i 



We also change the phase when 
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A frequent useful building block for our circuit is a Toffoli gate, or doubly-controlled not: v ® = uw. A 
cascade of Toffoli gates through a fc-bit register has depth 2k. However, if we use the "pseudo- Toffoli" gate 
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Figure 3: Swap of 4-bit registers X and Y controlled by c in depth 10. We assume that Y is initialized to 0. 



of Figure |3 the depth of the cascade can be reduced to k. See |BBC + 95] for an equivalent pseudo-Toffoli 
gate. 

The idea of Figurc|21is that we correctly set v to v (B uw, but we change the phase when \uvuu) = 1 01 1) . 
Normally this would be an unacceptable side effect, but there are two cases where we are okay: First, we 
may plan to undo this computation and fix the phase later. Second, we may know that the problem input 
is forbidden for some reason. 

For example, suppose we want to swap two n-bit registers X and Y controlled by a bit c. Suppose further 
that Y is initialized to 0. Then we can build a pseudo-Toffoli cascade as in Figure |21 Since each Toffoli 
target is known to be 0, there will be no phase shift. The depth is 2n + 2. 



3 Nested Adds 

We now describe our main result, the "nested adds" multiplier. We begin by describing a controlled multiplier 
with linear width and depth; we then explain how to modify it to be a modular multiplier. We conclude 
with an exponentiation circuit with linear width and quadratic depth. 



3.1 Nested Controlled Addition 

As noted in Section ^. II we can view controlled multiplication as repeated controlled addition. In this section, 
we build a repeated controlled adder. We have an n-bit register Z, initialized to some value z, and an n-bit 
register Y of control bits y,;. When the circuit concludes, we want Z to contain 




where the values Xi are n-bit constants. In the next section, we will convert this circuit to a modular 
multiplier. 

It is clear that n-bit addition controlled by a single bit yi requires linear depth on a nearest-neighbor 
machine; the control bit can affect all n bits of Z, so we need linear time to move (or pseudocopy) it from one 
end to the other. One might at first think that performing n controlled additions would require quadratic 
depth. However, if we use the transform adder, we can nest the additions. 

The basic structure of the circuit is depicted in Figure 21 We begin by performing the QFT on Z , in 
depth 2n — 3. Next, we take each bit of Y successively and swap it with each bit of Z. As we swap Yi 
with Zj, we also rotate Zj controlled by Y^ the rotation amount depends on X{. The idea is that we are 
adding in xi by rotating each bit of Z by the proper amount; all of these rotations commute, so the order is 
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Figure 4: Schematic for the "nested adds" repeated controlled adder. 



unimportant. This portion has depth 2n — 1; when it concludes, we have effectively swapped the Z and Y 
registers. 

Next, we perform the inverse QFT on Z. This again has depth 2n — 3. Finally, we move Y back to 
where it started in depth 2n — 1. 

As described, the total depth would be 8n — 8. However, as shown in Figure E| the inverse QFT nests 
nicely with the swaps with Y. We can start the inverse QFT at time 3n — 5, and we can start the final 
swaps at time An — 2. The total depth is only 6n — 4. 

If we can assume z is a constant, then we can replace the initial QFT with a single time-slice of n unitary 
transformations 3 on Z. The depth is reduced to An — 1. See Section |3~B1 for the reasons why we might want 
to allow nonzero z. For the remainder of this paper, we will assume that z is a constant, and that we can 
skip the initial QFT. 

3.2 Nested Controlled Modular Addition 

To turn the above circuit into a modular multiplier, we follow the procedure described in Section 12.11 We 
compute the sum s = yiXi congruent to the desired answer r modulo to. (Since we know our final answer 
has n bits, we need only compute the low n bits of s.) Simultaneously, we compute the approximate quotient 
q. We then subtract qm from our main register. Finally, we erase q. 

We compute q in an £-bit register Q, which we locate between Y and Z. We take I = £$ + log 2 n, so we 
have room to write the (n + log 2 n)-bit sum ^\ yi&i (which has in the low-order n — £q bits). 

We need to initialize the low £q bits of Q. If we have nonconstant data in Z, we could pseudocopy £q 
bits of it to Q; this is not expensive, but it might be costly to erase Z when we are done. In our case, we 
will initialize Z to a constant z, and Q to the high-order £q bits of z. 

We pass the bits of Y past Q and then Z. We compute the high bits of z + ^ yA m Qi an d we compute 
z + Y^, Vi x i mod 2 n in Z. 

As soon as the last bit has passed through Q, we compute q. For k = log 2 n down to 1, we first 
subtract 2 fc_1 TO from Q by doing a unary rotation on each bit. Next, we do an inverse QFT in depth at 
most 11 — 1; the top bit of Q is now a control bit indicating whether we should have subtracted 2 k ~ 1 m or 
not. We label that bit qk and think of it as no longer part of Q. We now do a QFT on the remaining bits of 
Q, and then move qk through Q; this adds 2 fc_1 m back if necessary, and also positions qk to go through Q. 

At step fc, we perform an inverse QFT on £q + k bits and a QFT on £q + k — 1 bits, and then we move 
qk through Q. The depth is A(£q + k) — 3. The total depth, summing from k = 1 to log 2 n, is 



We use the qk bits as control bits, subtracting 2 fc m as needed from s. When we are done, the answer r 
is in Z. When we pass the qk bits back up, we again take time given by 0) to uncomputc q. (Alternatively, 

3 For example, when z = 0, we apply a Hadamard to each qubit of Z. 



2£ 2 - 2ll + 0(1) = 2(2^ - log 2 n) log 2 n + 0(1). 



(1) 
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we could move all of Q past Z and then uncompute q.) 

We subtract z from Z after computing r. See Section ETSl for details. 
The total circuit depth for repeated controlled addition is 

An + A(2£ - log 2 n) log 2 n + 0(\og n). 

The width is 2n + I + 0(1). 

3.3 Controlled Modular Multiplication 

So far, we have assumed that the n control bits are present at the start of the computation. To complete 
our modular multiplier, we need to explain how to start from the multiplicand b and the overall control bit 
c and produce the control bits yi = btc. Also, since we want an in-place multiplier, we need to explain how 
to erase b when we are done (if c = 1). 

It is easy to perform the desired steps in linear depth, given the linear-depth out-of-place modular 
multiplication circuit described above. The challenging part is to keep the depth as low as possible. Our 
solution has depth 

lln + 6(2£ — log 2 n) log 2 n + O(logn), 

width 

3n + 2£ + l, 

and size 

5n 2 + 0(n log n), 

and is depicted in Figure We briefly describe the basic features of the circuit. 

We have three n-bit registers (labeled B, Y, and Z), two £-bit registers (labeled Qy and Qz), and one 
control bit c. Initially B contains b and the other four registers contain 0. When the circuit concludes, B 
contains b (when c = 0) or ab (when c = 1) and the other four registers contain 0. 

To start, we have Qy, then B and Y interleaved (i.e., we have Bq, Yq, B\, Y\, . . . , B n -\, Y n -i), and 
then c, Qz, and Z. When the circuit completes, we have Y, then Qy, then B interleaved with Z, then c, 
and finally Qz- So, except for the location of c, the bits have been flipped upside-down. (See Sect ion ETll for 
the reason we end with c in a different place.) 

We first move c through the interleaved B and Y, performing controlled swaps. If the contents of B and 
Y were wholly general, this process would have depth An, but because we know Y contains we can use 
pseudo-Toffolis (see Section and the depth is only 2n + 2. After the controlled swaps, we unmesh B 
and Y . 

Next, we multiply Y by a and write the result to Z. These gates are depicted in blue in Figure [5] We 
use Qz as a scratch register for computing q. We load a constant z into Z (and its high bits into Qz), then 
we perform the circuit described in the previous section, and finally we erase Qz and unload the constant z. 
When this portion concludes, if c = 0, then B contains b and Y and Z contain 0. If c = 1, then B contains 
0, Y contains b, and Z contains ab. 

We now perform the gates depicted in red in Figure^ We undo a multiplication of Z by a -1 , writing the 
result into Y. The red circuit is a backwards, upside-down version of the blue circuit. When we are done, Y 
contains 0. If c = 0, then B contains b and Z contains 0; if c = 1, then B contains and Z contains ab. 

Finally, we mesh B and Z and perform the controlled swap in reverse. (Again, we can use pseudo-Toffolis 
to reduce the depth to 2n + 2.) We write b or ab to B, and we write to Z, as desired. 

Note that part of the red circuit overlaps part of the blue circuit. In particular, we uncompute the first 
q while computing the second. This is why the second-order term in the depth is 6(2£— log 2 n) log 2 n rather 
than 8(2^ — log 2 n) log 2 n. 

We must swap B and Y before we can interleave B and Z . If our bits were arranged in a ring, we could 
bring B around from the other side; this would reduce the depth by about n and the size by about n 2 . One 
could construct a more symmetric version of Figure by moving B down to the bottom between the blue 
and red portions, but this increases the size by about n 2 without changing the depth. 
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3.4 Exponentiation 

We recall from Section [21 that our goal is to perform 2n controlled in-place modular multiplications. We 
will repeatedly apply the circuit of Section 13.31 Since that circuit leaves the machine "upside-down," we 
alternate between applying the circuit right-side-up and upside-down. 

Let ej denote the control bit in the ith round. We add one additional bit to the circuit of Section l3~31 
Just before we start the swap of B and Z controlled by a, we create our next control bit ej+i. Then, as soon 
as we have swapped two bits of the interleaved B and Z controlled by e,, we swap them again controlled by 
e-i+i (viewing them as B and Y for the next round). We can thus overlap these two controlled swaps; we 
reduce the depth of each round to only 9n + 0(log 2 n). 

There may be a technicality here because of the order in which we perform measurements. After we are 
done using e,, we measure it, and we may need to rotate e^+i based on the observed value of e,. We will 
assume that this is not a problem in practice. If necessary, we could generate 0(y / n) control bits at a time 
and use them; we would still have a depth of roughly 9n and a width of roughly 3n. 

Our circuit has depth 

18n 2 + 12n(2l - log 2 n) log 2 n + O(nlogn), 

width 

3n + 2£ + 2, 

and size 

10?i 3 + 0(n 2 logn). 

Here £ = O(logra) is chosen to control the error rate of our computation of q. See the next section for details. 

3.5 Error Analysis 

In this section we address two questions. First, how should we choose £1 Second, how does filling Z with a 
random value z improve our error analysis? 

We perform 4n modular multiplications. For each of these, we add n quantities to compute q. There 
are thus 4n 2 additions where we might make a mistake. Given random addends, the probability of an error 
propagating across a window of length £q is 2~ ia . Our probability of making an error is therefore at most 

4n 2 2~ £o = 2 21 °S2«+2-«o 
To reduce our error probability to a constant, we should take £q = 21og 2 n + 0(1), or 

l = £ + log 2 n = 31og 2 n + 0(1). 

What does an error rate of e mean in the quantum setting? Instead of attaining the desired state |</>), 
we attain a state \<f>) = a\cf>) + rj\ip), where the error state is orthogonal to \cj>) and |?7| 2 < e. A standard 
calculation yields that the distance between the probability distributions on measurements for \cj)) and \<p) 
is at most e. Note that an error may mean that we fail to erase scratch space correctly, invalidating future 
rounds, but this is irrelevant to the analysis. 

The assumption above of "random addends" may not be reasonable. Zalka |Zal02j discusses this problem: 
citing a "private objection" by Manny Knill, Zalka writes that "mathematically (and therefore very cau- 
tiously) inclined people have questioned the validity of this assumption." Our solution is to fill our register 
with a random constant z. (We can use the same z each time, or we can choose a different one for each 
multiplication.) The expected probability of an error in computing q over all our choices of z is the desired 
c. 

However, the constant z introduces another place where errors can occur. When we subtract z at the 
end, we do not perform a modular subtraction. If we ensure z < m/2*, the probability of an error at some 
point is 4rt2~'. We therefore take t = log 2 n + 0(1). Note that this increases £q to 31og 2 n + 0(1) and I to 
41og 2 n + 0(l). 
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4 A Classical Version 



The circuit of this paper requires numerous small controlled rotations. We now show that a variant of these 
ideas gives a reversible classical approximate exponentiation circuit with depth 0(n 2 logn) and size 0(n 3 ). 

We still organize exponentiation as repeated multiplication and multiplication as repeated addition. On 
a general architecture, we can attain depth 0(n 2 logn) using a logarithmic-depth adder |DKRS06| . On 
a nearest-neighbor machine, we cannot perform controlled addition in sublinear depth. As in our main 
construction, we nest different controlled additions to obtain an amortized depth of O(logn) per addition. 

We return to the setting of Section I5T1 We have an n-bit register Z (initialized to some value z) and an 
n-bit register Y . We wish to write to Z the quantity z + XiUi mod 2 n ; here the yts are bits of y and the 
XiS are n-bit constants. 

We follow the general structure of Figure Since we wish to build a classical circuit, we no longer 
perform any QFTs. Instead, we choose some t = O(logn), and we write k = [Vi/t"|. We divide Z into k 
blocks of size t; each "wire" of Z in Figure 0] represents a single block ZK (Each wire of Y is still a single 
bit j/j.) We also divide each Xi into blocks Xf of length t. 

We divide this portion of the circuit into n + k — 1 rounds. In round r, y r -j crosses Zj for all j (as long 
as < j < k and < r — j < n). At this time, we add the number 

3 

into Z. Note that 

to+fc— 1 n— 1 

/ ' -^r = / ' x iVi 
r=0 i=0 

as desired. Also note that, in round r, the control bit y r -j controlling the jth block of A r is next to Zj in 
memory. 

To add A r into Z, we first do k parallel controlled adds, one for each block. We erase our work, but we 
write down the high bit hj for each block. We hope that we correctly compute each hj\ this requires that 
no carry propagate through an entire block. 

Next, we again do k parallel controlled adds, but this time, for the jth block, we use hj—i as an incoming 
carry bit. If the hj bits are all correct, we correctly add A r into Z. 

Finally, we erase the hj bits. We compare Zj with y r ^jX^_j to determine if an overflow occurred; if so, 
hj must have been 1. We then exchange each y r -j bit with Zj to move the control bits into position for the 
next round. 

Each of these steps can be performed with a ripple-carry adder CDKM04 ; the depth is Ct for a small 
constant C. We need 2k extra bits: the high bits hj and one scratch bit for each ripple. 4 

To do modular multiplication, we use the same scheme as in our main construction: we estimate q on 
the side. The error analysis is the same. Note that we also perform 0(n 3 ) controlled additions of size t; the 
probability that some hj bit is wrong at some point is thus 0(n 3 2~ t ). We choose t = 0(\ogn) to reduce this 
probability to a small constant. 

We can use the pseudo-Toffoli gates described in Scction l^"51 to reduce the depth. It is interesting to note 
that, for the ripple-carry adder, we do not perform exactly the same gates when we undo the computation, 
but the "bad" case for the pseudo-Toffoli happens on the forward ripple if and only if it happens on the 
reverse ripple, so we fix our phase errors correctly. 

The circuit depth is 0(n 2 log n). The exact constant depends on the choice of I and t and on precisely 
how we perform the ripple-carry additions. 

4 We cannot use the ripple-carry adder of Takahashi and Kunihiro [TK051. Their adder eliminates the scratch bit, but it 
does not work on a nearest-neighbor machine. 
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5 General Architectures 



The "nested adds" multiplier of Section [3] can be simplified in several ways if implemented on a machine 
without a nearest-neighbor restriction: 

• The controlled swaps at the start and end of the multiplier can be performed in logarithmic depth. 
We fan the control bit c out into an empty n-bit register, perform n parallel swaps, and fan c back in. 
Note that we always have an empty n-bit register available. 

• The mesh and unmesh operations and any register swaps (all in black in Figure |5| are unnecessary. 
This reduces the depth by about n and the size by about 2n 2 . 

• The QFT and inverse QFT can be approximated. This does not improve the depth, but the size of 
each decreases from about n 2 /2 to O(nlogn). 

With these changes, the modular multiplier has depth 6n+6(2£— log 2 n) log 2 n+0(logn), width 3n+2£+l, 
and size 2n 2 + 0(n log n) . Taking I = 3 log 2 n + 0(1) as in Section 1X31 we get an exponentiation circuit with 
depth 

12n 2 + 60nlog 2 n + O(nlogn), 

width 

3n + 61og 2 ri + 0(l), 

and size 

4n 3 + 0{n 2 logn). 

We could further reduce the depth by using a parallel version of the QFT |CWOO| . but each multiply 
would still have depth at least 5n + 0(log 2 n). We could also consolidate the registers Qy and Qz] we would 
get a slight increase in depth and a slight decrease in width. 
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